Summary
Data Scientists need programming, mathematics, and database skills, many of which can be gained through self-learning.
Companies recruiting for a Data Science team need to understand the variety of different roles Data Scientists can play, and look for soft skills like storytelling and relationship building as well as technical skills.
High school students considering a career in Data Science should learn programming, math, databases, and, most importantly practice their skills.
Data Science is the field of exploring, manipulating, and analyzing data, and using data to answer questions or make recommendations.
As Data Science is not a discipline traditionally taught at universities, contemporary data scientists come from diverse backgrounds such as engineering, statistics, and physics.
The use cases for deep learning include speech recognition and classifying image at a large scale.
According to Dr. White, if someone is coming into a data science team, the first skills they would need are: knowing basic probability and some basic statistics, knowing some algebra and some calculus, understanding relational databases, knowing how to program, at least have some computational thinking.
According to Dr. White, the industrial world is shifting to a new trend, and for high school students to be on the right side of this new trend, his advice to them is:
So the very first step is measurement. If companies have existing data, then they should start looking at it and cleaning it. If they don’t have existing data, then they need to start collecting it. >> I think to look for a team who love to work as a data scientist. >> The first stop is to have employees, that they are interested on data science. because if you don’t have interest in your company, you will not have engagement. >> Companies should remember, that it’s key to have a team. So it’s not one data scientist, but a team of them, that each of them have strengths in different areas of data science.
The ultimate purpose of analytics is to communicate findings to the concerned who might use these insights to formulate policy or strategy. Analytics summarize findings in tables and plots. The data scientist should then use the insights to build the narrative to communicate the findings. In academia, the final deliverable is in the form of essays and reports. Such deliverables are usually 1,000 to 7,000 words in length. In consulting and business, the final deliverable takes on several forms. It can be a small document of fewer than 1,500 words illustrated with tables and plots, or it could be a comprehensive document comprising several hundred pages. Large consulting firms, such as McKinsey and Deloitte,I routinely generate analytics-driven reports to communicate their findings and, in the process, establish their expertise in specific knowledge domains.
Curiosity is one of the most important skills that a data scientist should have in addition to sense of humor and story telling.
When the companies are hiring people for a data science team, maybe a data scientist or an analyst, or a chief data scientist, the tendency would be to find the person who has all the skills, that they know the domain-specific knowledge. They’re excellent in analyzing structured and unstructured data. And they’re great at presenting and they’ve got great storytelling skills. So if you put all this together, you will realize you’re looking for a unicorn. And your odds of finding a unicorn are pretty rare. I think what you need to do to is to see, given the pool of applicants you have, who has the most resonance with your firm’s DNA. Because you can teach analytics skills, anyone can learn analytics skills if they dedicate time and effort to it. But what really matters is who’s passionate about the kind of business that you do. Someone could be a great data scientist in the retail environment, but they may not be that excited about working in IT related firms or working with gigabytes of weblogs. But if someone is excited about those weblogs, if someone is excited about health-related data then they would be able to contribute to your productivity much more so. And I would say if I’m looking for someone, if I have to put together a data science team, I would first look for curiosity. Is that person curious about things not just for data science but anything like, are they curious about why this room is painted a certain way, why do the bookshelves have books, and what kinds of books? They have to have a certain degree of curiosity about everything that is in their vision, that they look at. The second thing is do they have a sense of humor because, you see, you have to have a lighthearted about it. If someone is too serious about it, they probably would take it too seriously, and would not be able to look at the lighter elements. The third thing I think, and I think the last thing that I would look for if I had to have a hierarchy, the last thing I would look for are technical skills. I would go through the social skills, curiosity, and sense of humor. The ability to tell a story. The ability to know that there is a story there. And then once all is there then I would say, well, can you do the technical side of it? And if there is some hope or some sign of some technical skills, I would take them because I can train them in whatever skills they need. But I cannot teach curiosity. I cannot teach storytelling. I cannot certainly, instill sense of humor in anyone. >> I think there’s no hard and fast rule for hiring data scientists. I think it’s going to be a case by case thing. I would say there has to be some sort of technical component, somebody should be able to work with and manipulate the data. They should be able to communicate what they find in the data. I find quite often nobody really cares about the r-square or the confidence interval. So you have to be able to introduce those things and explain something in a compelling way. And they also have to find somebody who is relatable, because data science,
it been typically new means that the person in that role has to make relationships and they have to work across different departments. >> If these data scientist has a good mathematics and statistics background.>> They have to consider like problem solving abilities and analysis. The scientist needs to be good in analyzing problems.>> The persons they are hiring, they should love to play with data. And then they know how to play with the data visualization. They have analytical thinking.>> When a company is hiring anyone to work on a data science team, they need to think about what role that person is going to take. Before a company begins, they need to understand what they want out of their data science team. And then they need to hire to begin it. As they grow a data science team, they need to understand whether they need engineers, architects, designers to work on visualization. Or whether they just need more people who can multiply large matrices. >> From a skills point of view, let’s focus on the technical skills and in that case, first thing would be what kind of a technical platform would you like to adopt? Let’s say you want to work in a structured data environment and let’s say you want to work in market research. Then the type of skills you need are slightly different than someone who would like to work in big data environments. If you want to work in the traditional market research data, structure data environment, your skills should be some statistical knowledge and some knowledge of basic statistical algorithms, maybe some machine learning algorithms. And these are the tools that you would like to develop. If you want to work in big data, then there’s the other aspect of it and that is to be able to store data. So you start with the expertise in storing large amounts of data. And then you look into platforms that allow you to do that. The next step would be to be able to manipulate large amounts of data, and the final step would be to apply algorithms to those large sets of data. So it’s a three-step process. But most likely it starts, most importantly, it starts with where you would like to be, in what field, in what domain. In terms of platforms, let’s you want to be in the traditional predictive analytics environment, and you’re not working with big data, then R or Stata, or Python would be your tools. If you’re working mostly with unstructured data, then Python is most suitable than R. If you’re working with big data, then Hadoop and Spark are the environments that you will be working with. So it all depends upon where you would like to be and what kind of work excites you and then you pick your tools. In addition to technical skills, the second aspect of the data science is to have the ability to communicate. The communication skills or presentation skills. I call them story telling skills, that is that you have your analysis done, now can you tell a great story from it? If you have a very large table, can you synthesize this and make it more appealing that when it goes on the screen, or is it part of a document that it just speaks? It sings the findings and the reader just gets it right there. So the ability to present your findings, either verbally, or in a presentation, or in a document. So those communication and presentation skills are equally important as the technical skills are. When you have a grading side, when you’re presenting your results, imagine you’re driving on a mountain and then there’s a sharp turn. And you can’t see what’s beyond the turn. And then you make that turn and then suddenly, you see a tremendous valley in front of you. And this great sense of awe, that I didn’t know that, right? So when you present your findings and you have this great finding and you communicate it well, this is what people feel because they were not expecting it. They were not aware of it, and then this great sense of happiness that now I know. And I didn’t know this, now I know. And then it empowers them, it gives them ideas, what they can do with this knowledge, this new insight. It’s a great sense of joy. And you are able as a data scientist, you are able to share with your clients because you enabled it.
Data science requires programming.
Visual programming
Open source
Commercial software - leverage open source software
Cloud computing
python, R, SQL (recommended)
Scala, Java, C++, julia
It depends on what problems you need to solve.
It is a:
Python in Data Science
Learning up to three languages can increase your salary.
It is not open source like Python, rather it is a free software.
Easy to translate from math to code. It is popular in academia. It integrates well with other computer languages. And has stronger object-oriented programming facilities than most statistical computing languages.
How it works:
+ a non-procedural language + scope is limited to querying and managing data
Image
A combination of clause, expressions, predicates, queries, and statements.
Image
What makes SQL great
SQL databases available:
+ MySQL + PostgreSQL + SQLite + Oracle + IBMDB2 + MariaDB
TensorFlow.jsR-js: makes linear algebra possible in TypescriptImage
The ones with green labels can be done via cloud service.
In part one of this two-part series, we’ll cover data management, open source data integration, transformation, and visualization tools. The most widely used open source data management tools are relational databases such as MySQL and PostgreSQL; NoSQL databases such as MongoDB Apache CouchDB, and Apache Cassandra; and file-based tools such as the Hadoop File System or Cloud File systems like Ceph. Finally,Elasticsearch is mainly used for storing text data and creating a search index for fast document retrieval. The task of data integration and transformation in the classic data warehousing world is called ETL, which stands for “extract, transform, and load.” These days, data scientists often propose the term “ELT” – Extract, Load, Transform“ELT”, stressing the fact that data is dumped somewhere and the data engineer or data scientist themself is responsible for data. Another term for this process has now emerged: “data refinery and cleansing.” Here are the most widely used open source data integration and transformation tools: Apache AirFlow, originally created by AirBNB; KubeFlow, which enables you to execute data science pipelines on top of Kubernetes; Apache Kafka, which originated from LinkedIn; Apache Nifi, which delivers a very nice visual editor; Apache SparkSQL (which enables you to use ANSI SQL and scales up to compute clusters of 1000s of nodes), and NodeRED, which also provides a visual editor. NodeRED consumes so little in resources that it even runs on small devices like a Raspberry Pi. We’ll now introduce the most widely used open source data visualization tools. We have to distinguish between programming libraries where you need to use code and tools that contain a user interface. The most popular libraries are covered in the next videos. A similar approach uses Hue, which can create visualizations from SQL queries. Kibana, a data exploration and visualization web application, is limited to Elasticsearch (the data provider). Finally, Apache Superset is a data exploration and visualization web application. Model deployment is extremely important. Once you’ve created a machine learning model capable of predicting some key aspects of the future, you should make that model consumable by other developers and turn it into an API. Apache PredictionIO currently only supports Apache Spark ML models for deployment, but support for all sorts of other libraries is on the roadmap. Seldon is an interesting product since it supports nearly every framework, including TensorFlow, Apache SparkML, R, and scikit-learn. Seldon can run on top of Kubernetes and Redhat OpenShift. Another way to deploy SparkML models is by using MLeap. Finally, TensorFlow can serve any of its models using the TensorFlow service. You can deploy to an embedded device like a Raspberry Pi or a smartphone using TensorFlow Lite, and even deploy to a web browser using TensorFlow dot JS. Model monitoring is another crucial step. Once you’ve deployed a machine learning model, you need to keep track of its prediction performance as new data arrives in order to maintain outdated models. Following are some examples of model monitoring tools: ModelDB is a machine model metadatabase where information about the models are stored and can be queried. It natively supports Apache Spark ML Pipelines and scikit-learn. A generic, multi-purpose tool called Prometheus is also widely used for machine learning model monitoring, although it’s not specifically made for this purpose. Model performance is not exclusively measured through accuracy. Model bias against protected groups like gender or race is also important. The IBM AI Fairness 360 open source toolkit does exactly this. It detects and mitigates against bias in machine learning models. Machine learning models, especially neural-network-based deep learning models, can be subject to adversarial attacks, where an attacker tries to fool the model with manipulated data or by manipulating the model itself. The IBM Adversarial Robustness 360 Toolbox can be used to detect vulnerability to adversarial attacks and help make the model more robust. Machine learning modes are often considered to be a black box that applies some mysterious “magic.” The IBM AI Explainability 360 Toolkit makes the machine learning process more understandable by finding similar examples within a dataset that can be presented to a user for manual comparison. The IBM AI Explainability 360 Toolkit can also illustrate training for a simpler machine learning model by explaining how different input variables affect the final decision of the model. Options for code asset management tools have been greatly simplified: For code asset management – also referred to as version management or version control – Git is now the standard. Multiple services have emerged to support Git, with the most prominent being GitHub, which provides hosting for software development version management. The runner-up is definitely GitLab, which has the advantage of being a fully open source platform that you can host and manage yourself. Another choice is Bitbucket. Data asset management, also known as data governance or data lineage, is another crucial part of enterprise grade data science. Data has to be versioned and annotated with metadata. Apache Atlas is a tool that supports this task. Another interesting project, ODPi Egeria, is managed through the Linux Foundation and is an open ecosystem. It offers a set of open APIs, types, and interchange protocols that metadata repositories use to share and exchange data. Finally, Kylo is an open source data lake management software platform that provides extensive support for a wide range of data asset management tasks. This concludes part one of this two-part series. Now let’s move on to part two.
In this section, we’ll cover the development environment, open source data integration, transformation, and visualization tools. One of the most popular current development environments that data scientists are using is “Jupyter.” Jupyter first emerged as a tool for interactive Python programming; it now supports more than a hundred different programming languages through “kernels.” Kernels shouldn’t be confused with operating system kernels. Jupyter kernels are encapsulating the different interactive interpreters for the different programming languages. A key property of Jupyter Notebooks is the ability to unify documentation, code, output from the code, shell commands, and visualizations into a single document. JupyterLab is the next generation of Jupyter Notebooks and in the long term, will actually replace Jupyter Notebooks. The architectural changes being introduced in JupyterLab makes Jupyter more modern and modular. From a user’s perspective, the main difference introduced by JupyterLab is the ability to open different types of files, including Jupyter Notebooks, data, and terminals. You can then arrange these files on the canvas. Although Apache Zeppelin has been fully reimplemented, it’s inspired by Jupyter Notebooks and provides a similar experience. One key differentiator is the integrated plotting capability. In Jupyter Notebooks, you are required to use external libraries in Apache Zeppelin, and plotting doesn’t require coding. You can also extend these capabilities by using additional libraries. RStudio is one of the oldest development environments for statistics and data science, having been introduced in 2011. It exclusively runs R and all associated R libraries. However, Python development is possible and R is therefore tightly integrated into this tool to provide an optimal user experience. RStudio unifies programming, execution, debugging, remote data access, data exploration, and visualization into a single tool. Spyder tries to mimic the behaviour of RStudio to bring its functionality to the Python world. Although Spyder does not have the same level of functionality as RStudio, data scientists do consider it an alternative. But in the Python world, Jupyter is used more frequently. This diagram shows how Spyder integrates code, documentation, visualizations, and other components into a single canvas. Sometimes your data doesn’t fit into a single computer’s storage or main memory capacity. That’s where cluster execution environments come in. The well known cluster-computing framework Apache Spark is among the most active Apache projects and is used across all industries, including in many Fortune 500 companies. The key property of Apache Spark is linear scalability. This means, if you double the number of servers in a cluster, you’ll also roughly double its performance. After Apache Spark began to gain market share, Apache Flink was created. The key difference between Apache Spark and Apache Flink is that Apache Spark is a batch data processing engine, capable of processing huge amounts of data file by file. Apache Flink, on the other hand, is a stream processing image, with its main focus on processing real-time data streams. Although engine supports both data processing paradigms, Apache Spark is usually the choice in most use cases. One of the latest developments in the data science execution environments is called “Ray,” which has a clear focus on large-scale deep learning model training. Let’s look at open source tools for data scientists that are fully integrated and visual. With these tools, no programming knowledge is necessary. Most important tasks are supported by these tools; these tasks include data integration, transformation, data visualization, and model building. KNIME originated at the University of Konstanz in 2004. As you can see, KNIME has a visual user interface with drag-and-drop capabilities. It also has built-in visualization capabilities. Knime can be be extended by programming in R and Python, and has connectors to Apache Spark. Another example of this group of tools is Orange. It’s less flexible than KNIME, but easier to use. In this video, you’ve learned about the most common data science tasks and which open source tools are relevant to those tasks. In the next video, we’ll describe some established commercial tools that you’ll encounter in your data science experience.
When we focus on commercial data integration tools, we’re talking about “ETL” tools. We + Gartner Magic Quadrant, Informatica Powercenter, IBM InfoSphere DataStage + SAP + Oracle + SAS + Talend + Microsoft + Watson Studio Desktop
Commercial environment - data visualizations are utilizing business intelligence, or “BI”, tools.
When asking “How can different columns in a table relate to each other?” - Watson Studio Desktop
SaaS - Software as a Service - the cloud provider operates the tool for you in the cloud.
e.g.
When it comes to commercial data integration tools, we talk not only about “extract, transform, and load,” or “ETL” tools, but also about “extract, load, and transform,” or “ELT,” tools. This means the transformation steps are not done by a data integration team but are pushed towards the domain of the data scientist or data engineer.
Scientific computing Libraries in Python
Libraries can sometimes be called “frameworks”.
Visualization Libraries in Python
High Level- Machine Learning and Deep Learning (meaning that you don’t have to worry about the details, which also means that it is hard to improve)
Deep Learning Libraries in Python
Apache Spark: process data in parallel
Scala
R R has been the de-facto standard for open source data science but it is now being superseded by Python.
What is an API: lets two pieces of software talk to each other.
Image
API Libraries + TensorFlow
REST API + REST API + enabling you to communicate using the Internet, taking advantage of storage, greater data access, AI algorithms, and many other resources. + RE = Representational + S = State + T = Transfer + your program = client
Image
An API lets two pieces of software talk to each other. For example you have your program, you have some data, you have other software components. You use the API to communicate with the other software components.You don’t have to know how the API works, you just need to know its inputs and outputs. Remember, the API only refers to the interface, or the part of the library that you see. The “library” refers to the whole thing. Consider the pandas library. Pandas is actually a set of software components, many of which are not even written in Python. You have some data. You have a set of software components. We use the pandas API to process the data by communicating with the other software components. There can be a single software component at the back end, but there can be a separate API for different languages. Consider TensorFlow, written in C++. There are separate APIs in Python, JavaScript, C++ Java, and Go. The API is simply the interface. There are also multiple volunteer-developed APIs for TensorFlow; for example Julia, MATLAB, R, Scala, and many more. REST APIs are another popular type of API. They enable you to communicate using the internet, taking advantage of storage, greater data access, artificial intelligence algorithms, and many other resources. The RE stands for “Representational,” the S stands for “State,” the T stand for “Transfer.” In rest APIs, your program is called the “client.” The API communicates with a web service that you call through the internet. A set of rules governs Communication, Input or Request, and Output or Response. Here are some common API-related terms. You or your code can be thought of as a client. The web service is referred to as a resource. The client finds the service through an endpoint. The client sends the request to the resource and the response to the client. HTTP methods are a way of transmitting data over the internet We tell the REST APIs what to do by sending a request. The request is usually communicated through an HTTP message. The HTTP message usually contains a JSON file, which contains instructions for the operation that we would like the service to perform. This operation is transmitted to the web service over the internet. The service performs the operation. Similarly, the web service returns a response through an HTTP message, where the information is usually returned using a JSON file. This information is transmitted back to the client. The Watson Speech to Text API is an example of a REST API. This API converts speech to text. In the API call, you send a copy of the audio file to the API; this process is called a post request. The API then sends the text transcription of what the individual is saying. The API is making a get request. The Watson Language-Translator API provides another example. You send the text you would like to translate into the API, the API translates the text and sends the translation back to you. In this case we translate English to Spanish. In this video, we’ve discussed what an API is, API Libraries, REST APIs, including Request and Response.
Community Data License Agreement
+ CDLA-Sharing: Permission to use modify data; publication only under same terms + CDLA-Permissive: Permission to use and modify data; no obligations
cf
Image
basic architectural design of the Jupyter ecosystem. Jupyter implements a two-process model, with a kernel and a client. The client is the interface offering the user the ability to send the code to a kernel. The kernel executes the code and returns the result to the client for display. The client is the browser when using a Jupyter notebook. Jupyter notebooks represent your code, metadata, contents, and outputs. When saved it uses a dot I Pi NB (.ipynb) extension and a JSON structure. When you, the user, saves it, it is sent from your browser to the Notebook server which saves the notebook file on a disk as a JSON file with a dot I PI NB (.ipynb) extension. The Notebook server is responsible for saving and loading the notebooks. The kernel gets sent the cells of code when the user runs them. Jupyter also has an architecture of how it converts files to other formats. It uses a tool called NB convert. For example, if we want to convert a Notebook file into an HTML file, it will go through the following: The Notebook is modified by a preprocessor, an exporter converts the notebook to the new file format, and a postprocessor will work on the file produced by exporting it. After conversion, when you give the URL of the HTML file, it first fetches the notebook, converts it HTML, and displays the file to you in a HTML file. You should now be familiar with: The two-process model implementation of Jupyter, how Notebook servers communicate with kernels and clients, and the architectural design of how notebook files are converted to other files.
There are thousands of interesting jupyter notebooks available on the internet for you to learn from. One of the best sources is: https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks
It is important to notice that you can download such notebooks to your local computer or import them to a cloud based notebook tool so that you can rerun, modify and follow along what’s explained in the notebook.
Very often jupyter notebook are already shared in a rendered view. This means, that you can look at them as if they were running locally on you machine. But sometimes, folks only share a link to the jupyter file (which you can make out by the *.ipynb extention). In this case you can just grab the URL to that file and past it to the NB-Viewer => https://nbviewer.jupyter.org/
The list above gives you a very nice start with a huge collection of materials to explore. Therefore it’s maybe more useful to give you some pointers to interesting notebooks. As we have covered some toy examples with toy data in the labs, let me just point to some work which uses these data and goes further down the road of data science. In addition, as we’ve covered the different tasks in data science we’ll also provide an exemplar notebook for each of those.
First you start with exploratory data analysis, so this notebook is highly recommended to have a look at: https://nbviewer.jupyter.org/github/Tanu-N-Prabhu/Python/blob/master/Exploratory_data_Analysis.ipynb
For data integration / cleansing at a smaller scale, the python library pandas is often used. Please have a look at this notebook: https://towardsdatascience.com/data-cleaning-with-python-using-pandas-library-c6f4a68ea8eb
If you want to already experience what clustering is, have a look at this: https://nbviewer.jupyter.org/github/temporaer/tutorial_ml_gkbionics/blob/master/2%20-%20KMeans.ipynb
And finally, if you want to go for a more in-depth notebook on the iris dataset have a look here: https://www.kaggle.com/lalitharajesh/iris-dataset-exploratory-data-analysis
Git and GitHub, which are popular environments among developers and data scientists for performing version control of source code files and projects and collaborating with others. You can’t talk about Git and GitHub without a basic understanding of what version control is.
A version control system allows you to keep track of changes to your documents. This makes it easy for you to recover older versions of your document if you make a mistake, and it makes collaboration with others much easier. Here is an example to illustrate how version control works. Let’s say you’ve got a shopping list and you want your roommates to confirm the things you need and add additional items. Without version control, you’ve got a big mess to clean up before you can go shopping. With version control, you know EXACTLY what you need after everyone has contributed their ideas.
Git is free and open source software distributed under the GNU General Public License. Git is a distributed version control system, which means that users anywhere in the world can have a copy of your project on their own computer; when they’ve made changes, they can sync their version to a remote server to share it with you. Git isn’t the only version control system out there, but the distributed aspect is one of the main reasons it’s become one of the most common version control systems available. Version control systems are widely used for things involving code, but you can also version control images, documents, and any number of file types. You can use Git without a web interface by using your command line interface, but GitHub is one of the most popular web-hosted services for Git repositories. Others include GitLab, BitBucket, and Beanstalk. There are a few basic terms that you will need to know before you can get started. The SSH protocol is a method for secure remote login from one computer to another. A repository contains your project folders that are set up for version control. A fork is a copy of a repository. A pull request is the way you request that someone reviews and approves your changes before they become final. A working directory contains the files and subdirectories on your computer that are associated with a Git repository. There are a few basic Git commands that you will always use. When starting out with a new repository, you only need create it once: either locally, and then push to GitHub, or by cloning an existing repository by using the command “git init”.
“git add” moves changes from the working directory to the staging area. “git status” allows you to see the state of your working directory and the staged snapshot of your changes. “git commit” takes your staged snapshot of changes and commits them to the project. “git reset” undoes changes that you’ve made to the files in your working directory. “git log” enables you to browse previous changes to a project. “git branch” lets you create an isolated environment within your repository to make changes. “git checkout” lets you see and change existing branches. “git merge” lets you put everything back together again. To learn how to use Git effectively and begin collaborating with data scientists around the world, you will need to learn the essential commands. Luckily for us, GitHub has amazing resources available to help you get started. Go to try.github.io to download the cheat sheets and run through the tutorials. In the following modules, we’ll give you a crash course on setting up your local environment and getting started on a project.
git is a distributed version control system - meaning that anyone in the world can access your repositories
git-reset - Reset current HEAD to the specified state 1how to create and merge a branch using the GitHub web interface. A branch is a snapshot of your repository to which you can make changes. It is a copy of the master branch and can be used to develop and test changes to the workflow before merging it back to the master branch. In Git and GitHub, there is a main branch. The main branch which is called Master, is the one with deployable code and the official working version of your project. It is meant to be stable and it is always advisable never to push any code that is not tested to master. Many times, we want to make changes to the code and workflow in the master branch. That is when we create a copy of the Master branch. Let’s call it Child Branch. We will then create a copy of the workflow to the child branch in the child branch, changes and experiments are done. We will build and make edits, test the changes and when we are satisfied with the changes we will merge it back to the master branch where we prepare the model for deployment. We can see that all of this is done outside of the main branch and until we merge, changes will not be made to the workflow before we branched. To ensure that changes done by one member, does not impede or affect the flow of work of other members, multiple branches can be created and merged appropriately to master after the workflow is properly tested and approved. To create branches in GitHub, let’s look at this repository. There is currently one branch in the repository. I want to make some changes, but I don’t want to alter the master in case something goes wrong. We will create a branch. To do that, we will click the drop-down arrow and create a new branch. Let’s name it - child branch and then we will click enter. The repository now has two branches, the Master and the Child branch. You can check this by selecting Child branch in the Branch selector drop-down list. Whatever was in the Master branch was copied to the child branch. But we can add files in the child branch without adding any files to the master branch. To add a file, make sure Child branch is selected in the branch selector drop-down list. Click on create new file. In the space provided, name the file - we will name it testchild.py and then we will add a few lines of code. We will print the statement – Inside child branch. At the bottom of the screen, we will see a section called Commit new file. Commit messages are very important as it helps to keep track of the changes that were made. It is important to add a descriptive commit message so that other team members can understand it. Here we will add a commit message, Create testchild.py, then we will commit the new file. The file gets added to only the child branch. We can check this by going to the master branch by clicking ‘master’ from the Branch selector menu and here we can see that the new file is not added to the master branch. After we have created the new file, tested and made sure that is up to standards. We then want to merge the changes in the child branch to reflect in the master branch. To merge the changes, we will first have to create a pull request, also known as a PR. A pull request in simple terms is a way to notify other team members of your changes and edits and ask them for review so they can be pulled or merged into the master branch. Pull requests are the heart of collaboration on GitHub. When you open a pull request, you’re proposing your changes and requesting that someone review and pull in your contribution and merge them into the target branch. Pull requests show the differences of the content from both branches. To open a pull request and see the differences between the branches, click on the Compare and pull request button. If you scroll down to the bottom of the screen, you will see something like this that shows you the difference between both branches. As you can see on the screen it shows that one file has changed and the file has two additions, which are the two lines we added to the file and 0 deletions. We will now create the pull request. Add the title and an optional comment for the pull request. Click Create Pull request to create the pull request. You can assign team members to review and approve pull requests. On the next page you will see this image. If you are okay with the changes, click on Merge pull request and click confirm. You will get a confirmation that the pull request has been successfully merged. You can now delete the branch if you no longer need to make any edits or add new information. Now, the child branch has completely merged with the Master branch. You can check the Master branch and we can now see it contains the testchild.py file. You should now be familiar with how to create and merge branches using the web interface
Every business wants to work smarter, and to do that you need to tap into your company’s greatest resource, your data. But extracting the full value out of your data isn’t always an easy process. First. you end up juggling an incredibly large and complex collection of tools that are used for finding and cleaning data, analyzing and generating visualizations of that data, and using the data to build and deploy machine learning models. And to make matters worse these tools are often a time drain to individually manage, and can be difficult to integrate into your system, which can really slow down the workflow. But not anymore. Using Watson Studio you can simplify your data projects with a streamlined process, that allows you to extract value and insights from your data to help your business get smarter, faster. It delivers an easy-to-use collaborative data science and machine learning environment for building and training models, preparing and analyzing data, and sharing insights, all in one place. Watson Studios easy to create visualizations and drag-and-drop code put the power of database decision-making into the hands of any member of your organization with no need for IT assistance. And if you need access to open source tools, the environment offers some of the most popular and powerful ones available. Watson Studio single environment also creates a workflow that’s incredibly efficient so data scientists can share assets and work to solve problems within the system rather than starting from scratch every time a new issue arises. And developers can use that efficiency to quickly dive into building machine learning and deep learning algorithms. In fact, in the area of deep learning, Watson Studio supports some of the most popular frameworks and can deploy that deep learning on to the latest GPUs to help accelerate modeling by making it easier to use. The environments built-in neural network modeler also helps you build models with a simplified graphical interface even if you don’t have the dedicated resources to build a model from scratch, Watson’s Studio can help you get started with modeling templates for areas such as visual recognition, language classification, and other tools from IBM Watson services. Because Watson Studio is seamlessly integrated with the IBM Watson Knowledge Catalog, an intelligent asset discovery tool, you can transform data and models into trusted enterprise resources and collaborate with confidence, without compromising compliance, security or access control. Watson Studio provides many benefits for organizations helping to infuse AI into the business and drive innovation. You can train Watson Studio with embedded AI services including watson visual recognition. You can customize your models and deploy them as APIs or Core ML by using open source tools like Jupyter, Notebook, Anaconda and RStudio. Watson Studio supports most popular code libraries as well as no code visual modeling with neural network modeler for designing neural architectures using the most popular deep learning frameworks. In Watson Studio you can interactively discover, cleanse, and transform your data using data refinery. It helps you understand the quality and distribution of your data with built-in charts and statistics, and provides visualized results through interactive dashboards. Watson Studio includes an intuitive drag-and-drop interface that enables a non programmer to speed up the bottle building process by visually selecting, configuring, designing and auto coding neural networks. From development and training to production and evaluation, Watson Studio tracks your models over time to ensure you have the best performance for any given task using the best solutions across the entire lifecycle of your machine learning models
Image
A training set is a set of historical data in which the outcomes are already known. The training set acts like a gauge to determine if the model needs to be calibrated. In this stage, the data scientist will play around with different algorithms to ensure that the variables in play are actually required. The success of data compilation, preparation and modelling, depends on the understanding of the problem at hand, and the appropriate analytical approach being taken. The data supports the answering of the question, and like the quality of the ingredients in cooking, sets the stage for the outcome. Constant refinement, adjustments and tweaking are necessary within each step to ensure the outcome is one that is solid. In John Rollins’ descriptive Data Science Methodology, the framework is geared to do 3 things: First, understand the question at hand. Second, select an analytic approach or method to solve the problem, and third, obtain, understand, prepare, and model the data. The end goal is to move the data scientist to a point where a data model can be built to answer the question.
Python is a popular and powerful general purpose programming language that recently emerged as the preferred language among data scientists. You can write your machine-learning algorithms using Python, and it works very well. However, there are a lot of modules and libraries already implemented in Python, that can make your life much easier. We try to introduce the Python packages in this course and use it in the labs to give you better hands-on experience. The first package is NumPy which is a math library to work with N-dimensional arrays in Python. It enables you to do computation efficiently and effectively. It is better than regular Python because of its amazing capabilities. For example, for working with arrays, dictionaries, functions, datatypes and working with images you need to know NumPy. SciPy is a collection of numerical algorithms and domain specific toolboxes, including signal processing, optimization, statistics and much more. SciPy is a good library for scientific and high performance computation. Matplotlib is a very popular plotting package that provides 2D plotting, as well as 3D plotting. Basic knowledge about these three packages which are built on top of Python, is a good asset for data scientists who want to work with real-world problems. If you’re not familiar with these packages, I recommend that you take the data analysis with Python course first. This course covers most of the useful topics in these packages. Pandas library is a very high-level Python library that provides high performance easy to use data structures. It has many functions for data importing, manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and timeseries. SciKit Learn is a collection of algorithms and tools for machine learning which is our focus here and which you’ll learn to use within this course. As we’ll be using SciKit Learn quite a bit in the labs, let me explain more about it and show you why it is so popular among data scientists. SciKit Learn is a free Machine Learning Library for the Python programming language. It has most of the classification, regression and clustering algorithms, and it’s designed to work with a Python numerical and scientific libraries: NumPy and SciPy. Also, it includes very good documentation. On top of that, implementing machine learning models with SciKit Learn is really easy with a few lines of Python code. Most of the tasks that need to be done in a machine learning pipeline are implemented already in Scikit Learn including pre-processing of data, feature selection, feature extraction, train test splitting, defining the algorithms, fitting models, tuning parameters, prediction, evaluation, and exporting the model. Let me show you an example of how SciKit Learn looks like when you use this library. You don’t have to understand the code for now but just see how easily you can build a model with just a few lines of code. Basically, machine-learning algorithms benefit from standardization of the dataset. If there are some outliers or different scales fields in your dataset, you have to fix them. The pre-processing package of SciKit Learn provides several common utility functions and transformer classes to change raw feature vectors into a suitable form of vector for modeling. You have to split your dataset into train and test sets to train your model and then test the model’s accuracy separately. SciKit Learn can split arrays or matrices into random train and test subsets for you in one line of code. Then you can set up your algorithm. For example, you can build a classifier using a support vector classification algorithm. We call our estimator instance CLF and initialize its parameters. Now you can train your model with the train set by passing our training set to the fit method, the CLF model learns to classify unknown cases. Then we can use our test set to run predictions, and the result tells us what the class of each unknown value is. Also, you can use the different metrics to evaluate your model accuracy. For example, using a confusion matrix to show the results. And finally, you save your model. You may find all or some of these machine-learning terms confusing but don’t worry, we’ll talk about all of these topics in the following videos. The most important point to remember is that the entire process of a machine learning task can be done simply in a few lines of code using SciKit Learn. Please notice that though it is possible, it would not be that easy if you want to do all of this using NumPy or SciPy packages. And of course, it needs much more coding if you use pure Python programming to implement all of these tasks. Thanks for watching.
Difference between AI, ML, and DL
Python for Machine Learning
Image
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
from sklearn import svm
clf = svm.SVC(gamma=.001, C=100.)
clf.fit(X_train, y_train)
clf.predict(X_test)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, yhat, labels=[1, 0]))
import pickle
s = pickle.dumps(clf) # save model
Image
Regression algorithms
Training Accuracy * High training accuracy isn’t necessarily a good thing * Result of over-fitting * Over-fit: the model is overly trained to the dataset, which may capture noise and produce a non-generalized model
Out-of-Sample Accuracy * it’s important that our models have a high, out-of-sample accuracy * improvement using train/test split
MAE \[ MAE = \frac{1}{n}\Sigma_{j=1}^{n} \vert y_j - \hat y_j\vert \]
MSE \[ MSE = \frac{1}{n}\Sigma_{j=1}^{n} ( y_j - \hat y_j)^2 \]
More popular. It focuses more towards larger errors.
RMSE
\[ RMSE = \sqrt{\frac{1}{n}\Sigma_{j=1}^{n} ( y_j - \hat y_j)^2} \]
RAE Relative Absolute Error, aka Residual Sum of Square.
\[ RAE = \frac{\Sigma_{j=1}^{n} \vert y_j - \hat y_j\vert}{\Sigma_{j=1}^{n} \vert y_j - \bar y_j\vert} \]
RSE
\[ RAE = \frac{\Sigma_{j=1}^{n} (y_j - \hat y_j)^2}{\Sigma_{j=1}^{n} ( y_j - \bar y_j)^2} \]
See relation:
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()
Creating train and test dataset:
msk = np.random.rand(len(df)) < 0.8
print(msk)
train = cdf[msk]
test = cdf[~msk]
Modeling:
from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
regr.fit(train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
coef = regr.coef_[0][0]
inter = regr.intercept_[0]
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue')
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-', color='orange')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.annotate('y = {0:.0f} x + {1:.0f}'.format(coef, inter), xy=(5, 200))
plt.show()
Image
Evaluation:
from sklearn.metrics import r2_score
test_x = np.asanyarray(test[['ENGINESIZE']])
test_y = np.asanyarray(test[['CO2EMISSIONS']])
test_y_ = regr.predict(test_x)
print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y , test_y_) )
Modeling:
from sklearn import linear_model
regr = linear_model.LinearRegression()
x = np.asanyarray(train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = np.asanyarray(train[['CO2EMISSIONS']])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print(f'Intercept: {regr.intercept_}')
Prediction:
y_hat= regr.predict(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
x = np.asanyarray(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = np.asanyarray(test[['CO2EMISSIONS']])
print("Residual sum of squares: %.2f"
% np.mean((y_hat - y) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.6f' % regr.score(x, y))
Create a train and a test dataset:
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
test_x = np.asanyarray(test[['ENGINESIZE']])
test_y = np.asanyarray(test[['CO2EMISSIONS']])
poly = PolynomialFeatures(degree=2)
train_x_poly = poly.fit_transform(train_x)
train_x_poly
fit_transform takes our x values, and output a list of our data raised from power of 0 to power of 2 (since we set the degree of our polynomial to 2).
From \(y = b + \theta_1 x + \theta_2 x^2\) to \(y = b + \theta_1 x_1 + \theta_2 x_2\).
clf = linear_model.LinearRegression()
train_y_ = clf.fit(train_x_poly, train_y)
# The coefficients
print ('Coefficients: ', clf.coef_)
print ('Intercept: ',clf.intercept_)
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue')
XX = np.arange(0.0, 10.0, 0.1)
yy = clf.intercept_[0]+ clf.coef_[0][1]*XX+ clf.coef_[0][2]*np.power(XX, 2)
plt.plot(XX, yy, '-r' )
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()
Image
Evaluation:
from sklearn.metrics import r2_score
test_x_poly = poly.fit_transform(test_x)
test_y_ = clf.predict(test_x_poly)
print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))
print("R2-score: %.6f" % r2_score(test_y,test_y_ ) )
Check relation:
plt.figure(figsize=(8,5))
x_data, y_data = (df["Year"].values, df["Value"].values)
plt.plot(x_data, y_data, 'ro')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()
Image
Choose a logistic function that fits:
\[ \hat{Y} = \frac1{1+e^{\beta_1(X-\beta_2)}}\]
X = np.arange(-5.0, 5.0, 0.1)
Y = 1.0 / (1.0 + np.exp(-X))
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Independent Variable')
plt.show()
Image
def sigmoid(x, Beta_1, Beta_2):
y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))
return y
Find the best parameters for our model:
# Lets normalize our data
from scipy.optimize import curve_fit
xdata =x_data/max(x_data)
ydata =y_data/max(y_data)
popt, pcov = curve_fit(sigmoid, xdata, ydata)
#print the final parameters
print(" beta_1 = %f, beta_2 = %f" % (popt[0], popt[1]))
x = np.linspace(1960, 2015, 55)
x = x/max(x)
plt.figure(figsize=(8,5))
y = sigmoid(x, *popt)
plt.plot(xdata, ydata, 'ro', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.legend(loc='best')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()
Image
Classification Algorithms
Decision trees
Naive Bayes
Linear Discriminant Analysis
k-Nearest Neighbor
Logistic Regression
Neural Networks
Support Vector Machines (SVM)
comparing actual labels \(y\) with predicted labels \(\hat y\)
Higher accuracy = higher Jaccard index
The classifier with F1-score close to one means more ideal.
Output is a probability value between 0 and 1.
Lower log loss has better accuracy.
Algorithms
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import preprocessing
%matplotlib inline
import pandas as pd
import numpy as np
import mgt2001
from matplotlib import pyplot as plt
import matplotlib.cm as cm
import matplotlib.gridspec as gridspec
from matplotlib.ticker import MaxNLocator
import matplotlib.mlab as mlab
%matplotlib inline
plt.style.use('ggplot') # refined style
# Normalize Data first
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
k = 4
#Train Model and Predict
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
yhat = neigh.predict(X_test)
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))
For every other \(k\)s:
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
for n in range(1,Ks):
#Train Model and Predict
neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
yhat=neigh.predict(X_test)
mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])
mean_acc
plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.fill_between(range(1,Ks),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color="green")
plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()
Image
Example:
Building Decision Trees
Choose the attribute that has:
After trying out every attribute in the dataset, how do we determine which attribute is the best?
Image
Answer: The tree with the higher information gain after splitting.
Information gain: the information that can increase the level of certainty after splitting \(\text{Information Gain} = \text{(Entropy before split)} - \text{(weighted entropy after split)}\)
from sklearn.tree import DecisionTreeClassifier
Sklearn Decision Trees do not handle categorical variables.
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1])
le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])
le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3])
X[0:5]
Setting up the Decision Tree:
from sklearn.model_selection import train_test_split
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)
Modeling
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
# drugTree # it shows the default parameters
drugTree.fit(X_trainset,y_trainset)
Making Prediction
predTree = drugTree.predict(X_testset)
print (predTree [0:5])
print (y_testset [0:5])
Evaluation
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))
Visualization
from io import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline
dot_data = StringIO()
filename = "drugtree.png"
featureNames = my_data.columns[0:5]
targetNames = my_data["Drug"].unique().tolist()
out=tree.export_graphviz(drugTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_trainset), filled=True, special_characters=True,rotate=False)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')
plt.show()
Image
X = np.asarray(churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])
X[0:5]
y = np.asarray(churn_df['churn'])
y [0:5]
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
C parameter indicates inverse of regularization strength which must be a positive float. Smaller values specify stronger regularization.predict_proba returns estimates for all classes, ordered by the label of classes.from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR
yhat = LR.predict(X_test)
yhat
yhat_prob = LR.predict_proba(X_test)
yhat_prob
from sklearn.metrics import jaccard_score
jaccard_score(y_test, yhat,pos_label=0)
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0]) # result
# np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False, title='Confusion matrix')
print (classification_report(y_test, yhat))
from sklearn.metrics import log_loss
log_loss(y_test, yhat_prob)
Lower log loss means better accuracy.
Trying different solvers:
# write your code here
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
lls = list()
solvers=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
for solver in solvers:
LR = LogisticRegression(C=0.01, solver=solver).fit(X_train,y_train)
yhat = LR.predict(X_test)
yhat_prob = LR.predict_proba(X_test)
print(yhat)
ll = log_loss(y_test, yhat_prob)
lls.append(ll)
print(f'{solver}\'s Log Loss: {ll}')
print(f"{min(lls)} at {solvers[lls.index(min(lls))]}")
SVM is a supervised algorithm that classifies cases by finding a separator.
Basically, SVMs are based on the idea of finding a hyperplane that best divides a data set into two classes as shown here. One reasonable choice as the best hyperplane is the one that represents the largest separation or margin between the two classes. So the goal is to choose a hyperplane with as big a margin as possible. Examples closest to the hyperplane are support vectors. It is intuitive that only support vectors matter for achieving our goal. And thus, other trending examples can be ignored. We tried to find the hyperplane in such a way that it has the maximum distance to support vectors.
SVM is good for image analysis tasks, such as image classification and hand written digit recognition. Also, SVM is very effective in text mining tasks, particularly due to its effectiveness in dealing with high-dimensional data. For example, it is used for detecting spam, text category assignment and sentiment analysis. Another application of SVM is in gene expression data classification, again, because of its power in high-dimensional data classification. SVM can also be used for other types of machine learning problems, such as regression, outlier detection and clustering. I’ll leave it to you to explore more about these particular problems. This concludes this video, thanks for watching.
feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(feature_df)
cell_df['Class'] = cell_df['Class'].astype('int')
y = np.asarray(cell_df['Class'])
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
Four kernal types:
['linear', 'poly', 'rbf', 'sigmoid', 'precomputed']
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
yhat = clf.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)
print (classification_report(y_test, yhat))
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False, title='Confusion matrix')
precision recall f1-score support
2 1.00 0.94 0.97 90
4 0.90 1.00 0.95 47
accuracy 0.96 137
macro avg 0.95 0.97 0.96 137
weighted avg 0.97 0.96 0.96 137
Image
from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted')
from sklearn.metrics import jaccard_score
jaccard_score(y_test, yhat,pos_label=2)